Skip to content

fix: SDK-4475 wait for in-flight init in initWithContextSuspend to avoid SessionService NPE#2637

Open
abdulraqeeb33 wants to merge 3 commits intomainfrom
ar/sdk-4475
Open

fix: SDK-4475 wait for in-flight init in initWithContextSuspend to avoid SessionService NPE#2637
abdulraqeeb33 wants to merge 3 commits intomainfrom
ar/sdk-4475

Conversation

@abdulraqeeb33
Copy link
Copy Markdown
Contributor

Description

One Line Summary

Closes a race under SDK_BACKGROUND_THREADING where SyncJobService re-entering the internal suspend initWithContext returned true before bootstrap finished, causing SessionService.backgroundRun() to NPE on a still-null session field.

Linear: SDK-4475

Details

Motivation

Production stack (only seen with SDK_BACKGROUND_THREADING enabled):

java.lang.NullPointerException
    at com.onesignal.session.internal.session.impl.b.endSession(Unknown Source:2)
    at com.onesignal.session.internal.session.impl.b.backgroundRun(Unknown Source:0)
    at com.onesignal.core.internal.background.impl.a$b$a.invokeSuspend(Unknown Source:74)
    ...
    at com.onesignal.core.internal.background.impl.a.runBackgroundServices(Unknown Source:6)
    at com.onesignal.core.services.SyncJobService$a.invokeSuspend(Unknown Source:104)

The race:

  1. Host app calls OneSignal.initWithContext(context, appId). Under the FF this flips initState to IN_PROGRESS and runs internalInit on a fire-and-forget IO coroutine, then returns immediately.
  2. While that coroutine is still inside internalInit (before bootstrapServices() has called SessionService.bootstrap()), SyncJobService.onStartJob fires on a separate IO worker.
  3. SyncJobService calls OneSignal.initWithContext(this) (the internal suspend overload, no-appId).
  4. initWithContextSuspend saw initState.isSDKAccessible() == true (because IN_PROGRESS is "accessible") and returned true without waiting for bootstrap to finish — violating its own documented contract: "Remain suspend until initialization is fully completed."
  5. SyncJobService proceeded to OneSignal.getService<IBackgroundManager>().runBackgroundServices(), which loops through IBackgroundServices and calls backgroundRun() on SessionService whose session field was still null. NPE on if (!session!!.isValid).

With FF off, public initWithContext runs runBlocking { internalInit(...) } synchronously — bootstrap is always finished before init returns. The window doesn't exist.

Scope

  • OneSignalImp.initWithContextSuspend — when init is already in flight, now suspends on suspendCompletion.await() until it actually completes, then returns based on the final state (SUCCESS → true, FAILED → false). Honors the documented contract.
  • Public sync initWithContext(context, appId) — intentionally unchanged. The fire-and-forget under FF-on is the entire point of the FF (no ANR on host app's MainApplication.onCreate()). The wait belongs on the suspend re-entry path, where the caller (SyncJobService) is already on a background coroutine and depends on the SDK being fully ready for its very next line.
  • SessionService — defensive null guards in endSession / onFocus / onUnfocused / startTime / scheduleBackgroundRunIn so any future caller that bypasses bootstrap() no-ops instead of crashing here.

Wait duration

Bounded by actual internalInit cost — initEssentials + bootstrapServices + resolveAppId + userSwitcher.initUser. Tens to hundreds of ms in practice, well within JobService budgets.

Relationship to #2605

#2605 addresses the same general class (callers acting while IN_PROGRESS) but at the accessor side (getServiceWithFeatureGate blocks on IN_PROGRESS). It does not close this NPE because SyncJobService reaches IBackgroundManager via raw getService<T>() (IServiceProvider), which doesn't go through the accessor gate. The two fixes are complementary; merge order is independent.

Testing

Unit testing

Added SDKInitTests."initWithContextSuspend with in-flight init waits for completion before returning":

  • Stalls a first initWithContextSuspend inside internalInit via a custom BlockingPrefsContext that signals (via CountDownLatch) when getSharedPreferences is first touched — by which point initState is deterministically IN_PROGRESS.
  • Kicks off a second initWithContextSuspend and captures os.isInitialized at the exact moment that second call returns.
  • Asserts os.isInitialized == true at return — i.e. the second call did not return until init was fully complete.

Verified locally:

  • Without the fix (OneSignalImp.kt reverted): test FAILS with expected:<true> but was:<false> — exactly catches the bug.
  • With the fix: test PASSES.

Full :OneSignal:core:testReleaseUnitTest run: 801 tests, only the 2 unrelated pre-existing SDKInitTests failures remain on main (those are what #2605 addresses); confirmed pre-existing by stashing this PR and re-running. All SDKInitSuspendTests (11) and SessionListenerTests (3) pass.

Manual testing

Not directly reproducible at will without instrumentation (race window depends on JobService timing). Repro path is documented above; the regression test deterministically reproduces the contract violation.

Affected code checklist

  • Notifications
    • Display
    • Open
    • Push Processing
    • Confirm Deliveries
  • Outcomes
  • Sessions
  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing
  • Any Public API changes are explained in the PR details and conform to existing APIs (no public API changes)

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible (only the 2 pre-existing SDKInitTests failures remain — unrelated, addressed by fix: resolve pre-existing test failures on main #2605)
  • I have personally tested this on my device, or explained why that is not possible — race window depends on JobService timing; the regression test deterministically reproduces the contract violation instead

Final pass

  • Code is as readable as possible.
  • I have reviewed this PR myself, ensuring it meets each checklist item

Made with Cursor

…oid SessionService NPE

Under SDK_BACKGROUND_THREADING, the public initWithContext(context, appId)
runs internalInit() on a fire-and-forget IO coroutine. The internal-only
suspend overload initWithContext(context) -- used by SyncJobService.onStartJob
-- was returning true as soon as initState was IN_PROGRESS, even though
bootstrap() had not yet populated SessionService.session. SyncJobService
would then call runBackgroundServices(), which invokes
SessionService.backgroundRun() -> endSession() and NPEs on session!!.isValid.

Fix the suspend overload to honor its documented contract ("Remain suspend
until initialization is fully completed"): when init is already in flight,
suspend on suspendCompletion.await() until it actually completes, then
return based on the final state. The public sync overload is unchanged --
host app's MainApplication.onCreate() still returns immediately, no ANR
risk re-introduced.

Also add defensive null guards to SessionService.endSession / onFocus /
onUnfocused / startTime / scheduleBackgroundRunIn so any future caller
that bypasses bootstrap() no-ops instead of crashing.

Adds a regression test (SDKInitTests) that stalls a first
initWithContextSuspend inside internalInit (via a custom
BlockingPrefsContext that signals when prefs are first touched), kicks off
a second initWithContextSuspend, and asserts that os.isInitialized is true
at the moment the second call returns. Verified to fail on main and pass
with the fix.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copilot AI review requested due to automatic review settings May 6, 2026 18:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a race condition under SDK_BACKGROUND_THREADING where a re-entrant suspend initialization (initWithContextSuspend) could return before bootstrap completed, allowing background services (notably SessionService) to run with uninitialized/null internal models.

Changes:

  • Updated OneSignalImp.initWithContextSuspend to await in-flight initialization completion before returning a result.
  • Added a regression test that reproduces the re-entrant init race and asserts the suspend init does not return early.
  • Added defensive null-guards in SessionService to avoid NPEs when invoked before bootstrap().

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt Makes re-entrant suspend init wait for completion instead of returning during IN_PROGRESS.
OneSignalSDK/onesignal/core/src/main/java/com/onesignal/session/internal/session/impl/SessionService.kt Adds pre-bootstrap null handling to prevent crashes when background services run too early.
OneSignalSDK/onesignal/core/src/test/java/com/onesignal/core/internal/application/SDKInitTests.kt Adds a regression test to catch early-return behavior during in-flight initialization.
Comments suppressed due to low confidence (1)

OneSignalSDK/onesignal/core/src/main/java/com/onesignal/internal/OneSignalImp.kt:662

  • If internalInit(...) throws (or any unexpected exception occurs before notifyInitComplete()), initState can remain IN_PROGRESS and suspendCompletion may never complete. With the new suspendCompletion.await() path, this can hang re-entrant init callers indefinitely. Wrap the init execution so initState is set to FAILED and the completion signal is completed in a finally block (capturing the exception into initFailureException as appropriate).
            if (!shouldRunInit) {
                // Another caller has already started (or completed) init. Honor this method's
                // contract by suspending until initialization is *fully* completed -- not just
                // kicked off. This closes a race where re-entrant suspend callers (e.g. the
                // SyncJobService entry point under SDK_BACKGROUND_THREADING) would otherwise
                // proceed to use IBackgroundService implementations like SessionService whose
                // bootstrap() had not yet run, NPE'ing on still-null model fields.
                Logging.log(LogLevel.DEBUG, "initWithContext: init already in progress or completed, awaiting completion")
                suspendCompletion.await()
                return@withContext initState == InitState.SUCCESS
            }

            val result = internalInit(context, appId)
            // initState is already set correctly in internalInit, no need to overwrite it
            result

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +512 to +516
val blockingCtx = object : ContextWrapper(context) {
override fun getSharedPreferences(name: String, mode: Int): SharedPreferences {
started.countDown()
trigger.await()
return super.getSharedPreferences(name, mode)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — addressed in 524c45b. Bounded trigger.await(30, TimeUnit.SECONDS) inside the ContextWrapper.getSharedPreferences() overrides for both this test and the original in-flight init test, so a logic bug or cancelled coroutine can't leave the IO worker blocking forever and stalling the suite.

Comment on lines +547 to +552
// Sanity: the second caller has not pre-empted the test by returning before
// we unblock the first caller (timing depends on lazy ServiceProvider locks).
Thread.sleep(200)

// Unblock the first caller so internalInit() can complete (state -> SUCCESS).
trigger.countDown()
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The 200ms sleep is intentionally testing a negative assertion — "the second caller did NOT return within window X" — which collapses to the same wall-clock semantics whether expressed as Thread.sleep, delay(), or polling, since polling for an event that should never fire is just a sleep with extra steps. The flake risk is only on the upper bound (we picked 200ms, the actual window can be much smaller), and the bounded trigger.await(30s) from 524c45b ensures the test fails-fast rather than hanging if 200ms is ever insufficient.

Happy to switch to delay() for stylistic consistency with coroutine-based tests if you'd prefer — the semantic is identical.

) : ISessionService, IBootstrapService, IStartableService, IBackgroundService, IApplicationLifecycleHandler {
override val startTime: Long
get() = session!!.startTime
get() = session?.startTime ?: 0L
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 524c45b. startTime now defaults to _time.currentTimeMillis when bootstrap hasn't run, so call sites computing now - startTime (IAM session-duration in InAppMessagesManager and SESSION_TIME triggers in DynamicTriggerController) see ~0ms elapsed instead of ~58 years (Jan 1970).

Comment on lines +640 to +646
val shouldRunInit: Boolean
synchronized(initLock) {
if (initState.isSDKAccessible()) {
Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress")
return@withContext true
shouldRunInit = !initState.isSDKAccessible()
if (shouldRunInit) {
initState = InitState.IN_PROGRESS
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4b5d86f. suspendCompletion is now a @Volatile var reset under synchronized(initLock) whenever state flips into IN_PROGRESS — applied to both the sync initWithContext(ctx, appId) and suspend initWithContextSuspend paths to keep them consistent. Both await sites (initWithContextSuspend and waitUntilInitInternal) now local-capture the deferred under the lock so they wait on the same generation they observed.

Added regression test initWithContextSuspend resets latch on retry-after-FAILED which: forces a FAILED first init, kicks off a stalled retry, then a re-entrant suspend caller. Verified to fail on the un-hardened code with expected:<true> but was:<false> (the re-entrant caller wakes on the stale latch and reads transient IN_PROGRESS state) and pass with the fix.

Comment on lines 633 to +646
Logging.log(LogLevel.DEBUG, "initWithContext(context: $context, appId: $appId)")

// This ensures the stack trace points to the caller that triggered init, not the async worker thread.
initFailureException = IllegalStateException("OneSignal initWithContext failed.")

// Use IO dispatcher for initialization to prevent ANRs and optimize for I/O operations
return withContext(runtimeIoDispatcher) {
// do not do this again if already initialized or init is in progress
val shouldRunInit: Boolean
synchronized(initLock) {
if (initState.isSDKAccessible()) {
Logging.log(LogLevel.DEBUG, "initWithContext: SDK already initialized or in progress")
return@withContext true
shouldRunInit = !initState.isSDKAccessible()
if (shouldRunInit) {
initState = InitState.IN_PROGRESS
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 524c45b. initFailureException is now only assigned inside the shouldRunInit == true branch of the synchronized(initLock) block, so re-entrant callers (the SyncJobService entry point in particular) no longer overwrite the original initiator's failure-attribution stack trace.

abdulraqeeb33 pushed a commit that referenced this pull request May 7, 2026
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI):

Defect 1 (stale latch on retry-after-FAILED):
suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once
any init terminated -- including FAILED -- the deferred stayed permanently
complete. A re-entrant suspend caller arriving DURING a subsequent retry
would `await()` on the already-completed deferred, return instantly, and
read transient initState (likely IN_PROGRESS -> false), silently dropping
JobService work. Same flaw also affected `waitUntilInitInternal`.

Make suspendCompletion mutable (`@Volatile var`) and reset it under
`synchronized(initLock)` whenever state flips into IN_PROGRESS (both
sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`).
Both await sites local-capture the deferred under the lock so they wait
on the same generation they observed -- never on a stale one.

Defect 2 (indefinite hang if internalInit throws):
internalInit had no try/catch wrapping its body. An unchecked throw from
initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/
userSwitcher.initUser/startupService.scheduleStart would leave initState
at IN_PROGRESS and suspendCompletion uncompleted forever. With the new
await() path introduced in PR #2637, every re-entrant suspend caller
(SyncJobService) would hang on await() until the OS killed the worker.

Wrap internalInit's body in try/catch: on any throw, attach the cause to
initFailureException, set initState = FAILED, call notifyInitComplete(),
and return false instead of rethrowing. Guarantees a terminal state and
released waiters on every code path.

Two new regression tests in SDKInitTests:
- "resets latch on retry-after-FAILED": stalls a retry after a forced
  FAILED first init, kicks off a re-entrant caller, asserts it doesn't
  wake on the stale latch (verified to fail on the un-hardened code with
  expected:<true> but was:<false>).
- "reaches terminal state when internalInit throws": forces a throw
  inside internalInit, asserts the suspend init returns false and the
  SDK can retry cleanly afterwards (verified to fail on the un-hardened
  code with the throw escaping as RuntimeException).

Co-authored-by: Cursor <cursoragent@cursor.com>
abdulraqeeb33 pushed a commit that referenced this pull request May 7, 2026
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed)
  instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call
  sites computing `now - startTime` (IAM session-duration / SESSION_TIME
  triggers) get a sensible default rather than Jan 1970 deltas.
  (Copilot review comment)

- initWithContextSuspend: only assign `initFailureException` when the
  call actually starts init (`shouldRunInit == true`). Re-entrant callers
  no longer overwrite the original initiator's failure-attribution stack
  trace. (Copilot review comment)

- SDKInitTests: bound the `trigger.await()` inside the in-test
  ContextWrapper overrides to 30s so that a logic bug elsewhere in the
  test (or a cancelled coroutine) can't deadlock the IO worker forever
  and stall the suite. (Copilot review comment)

Co-authored-by: Cursor <cursoragent@cursor.com>
abdulraqeeb33 pushed a commit that referenced this pull request May 7, 2026
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI):

Defect 1 (stale latch on retry-after-FAILED):
suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once
any init terminated -- including FAILED -- the deferred stayed permanently
complete. A re-entrant suspend caller arriving DURING a subsequent retry
would `await()` on the already-completed deferred, return instantly, and
read transient initState (likely IN_PROGRESS -> false), silently dropping
JobService work. Same flaw also affected `waitUntilInitInternal`.

Make suspendCompletion mutable (`@Volatile var`) and reset it under
`synchronized(initLock)` whenever state flips into IN_PROGRESS (both
sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`).
Both await sites local-capture the deferred under the lock so they wait
on the same generation they observed -- never on a stale one.

Defect 2 (indefinite hang if internalInit throws):
internalInit had no try/catch wrapping its body. An unchecked throw from
initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/
userSwitcher.initUser/startupService.scheduleStart would leave initState
at IN_PROGRESS and suspendCompletion uncompleted forever. With the new
await() path introduced in PR #2637, every re-entrant suspend caller
(SyncJobService) would hang on await() until the OS killed the worker.

Wrap internalInit's body in try/catch: on any throw, attach the cause to
initFailureException, set initState = FAILED, call notifyInitComplete(),
and return false instead of rethrowing. Guarantees a terminal state and
released waiters on every code path.

Two new regression tests in SDKInitTests:
- "resets latch on retry-after-FAILED": stalls a retry after a forced
  FAILED first init, kicks off a re-entrant caller, asserts it doesn't
  wake on the stale latch (verified to fail on the un-hardened code with
  expected:<true> but was:<false>).
- "reaches terminal state when internalInit throws": forces a throw
  inside internalInit, asserts the suspend init returns false and the
  SDK can retry cleanly afterwards (verified to fail on the un-hardened
  code with the throw escaping as RuntimeException).

Co-authored-by: Cursor <cursoragent@cursor.com>
abdulraqeeb33 pushed a commit that referenced this pull request May 7, 2026
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed)
  instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call
  sites computing `now - startTime` (IAM session-duration / SESSION_TIME
  triggers) get a sensible default rather than Jan 1970 deltas.
  (Copilot review comment)

- initWithContextSuspend: only assign `initFailureException` when the
  call actually starts init (`shouldRunInit == true`). Re-entrant callers
  no longer overwrite the original initiator's failure-attribution stack
  trace. (Copilot review comment)

- SDKInitTests: bound the `trigger.await()` inside the in-test
  ContextWrapper overrides to 30s so that a logic bug elsewhere in the
  test (or a cancelled coroutine) can't deadlock the IO worker forever
  and stall the suite. (Copilot review comment)

Co-authored-by: Cursor <cursoragent@cursor.com>
@abdulraqeeb33 abdulraqeeb33 changed the base branch from main to fix/pre-existing-test-failures May 7, 2026 15:09
AR Abdul Azeez and others added 2 commits May 7, 2026 11:34
Addresses review feedback on PR #2637 (claude[bot] + Copilot AI):

Defect 1 (stale latch on retry-after-FAILED):
suspendCompletion was a single-shot `val CompletableDeferred<Unit>`. Once
any init terminated -- including FAILED -- the deferred stayed permanently
complete. A re-entrant suspend caller arriving DURING a subsequent retry
would `await()` on the already-completed deferred, return instantly, and
read transient initState (likely IN_PROGRESS -> false), silently dropping
JobService work. Same flaw also affected `waitUntilInitInternal`.

Make suspendCompletion mutable (`@Volatile var`) and reset it under
`synchronized(initLock)` whenever state flips into IN_PROGRESS (both
sync `initWithContext(ctx, appId)` and suspend `initWithContextSuspend`).
Both await sites local-capture the deferred under the lock so they wait
on the same generation they observed -- never on a stale one.

Defect 2 (indefinite hang if internalInit throws):
internalInit had no try/catch wrapping its body. An unchecked throw from
initEssentials/bootstrapServices/subscribeToConfigStore/updateConfig/
userSwitcher.initUser/startupService.scheduleStart would leave initState
at IN_PROGRESS and suspendCompletion uncompleted forever. With the new
await() path introduced in PR #2637, every re-entrant suspend caller
(SyncJobService) would hang on await() until the OS killed the worker.

Wrap internalInit's body in try/catch: on any throw, attach the cause to
initFailureException, set initState = FAILED, call notifyInitComplete(),
and return false instead of rethrowing. Guarantees a terminal state and
released waiters on every code path.

Two new regression tests in SDKInitTests:
- "resets latch on retry-after-FAILED": stalls a retry after a forced
  FAILED first init, kicks off a re-entrant caller, asserts it doesn't
  wake on the stale latch (verified to fail on the un-hardened code with
  expected:<true> but was:<false>).
- "reaches terminal state when internalInit throws": forces a throw
  inside internalInit, asserts the suspend init returns false and the
  SDK can retry cleanly afterwards (verified to fail on the un-hardened
  code with the throw escaping as RuntimeException).

Co-authored-by: Cursor <cursoragent@cursor.com>
- SessionService.startTime: return `_time.currentTimeMillis` (~0ms elapsed)
  instead of `0L` (~58 years elapsed) when bootstrap hasn't run, so call
  sites computing `now - startTime` (IAM session-duration / SESSION_TIME
  triggers) get a sensible default rather than Jan 1970 deltas.
  (Copilot review comment)

- initWithContextSuspend: only assign `initFailureException` when the
  call actually starts init (`shouldRunInit == true`). Re-entrant callers
  no longer overwrite the original initiator's failure-attribution stack
  trace. (Copilot review comment)

- SDKInitTests: bound the `trigger.await()` inside the in-test
  ContextWrapper overrides to 30s so that a logic bug elsewhere in the
  test (or a cancelled coroutine) can't deadlock the IO worker forever
  and stall the suite. (Copilot review comment)

Co-authored-by: Cursor <cursoragent@cursor.com>
@abdulraqeeb33 abdulraqeeb33 changed the base branch from fix/pre-existing-test-failures to main May 7, 2026 15:36
@abdulraqeeb33
Copy link
Copy Markdown
Contributor Author

Branch unstacked from #2605 + force-pushed.

The branch was previously rebased on top of #2605 (fix/pre-existing-test-failures), which pulled #2605's detekt violations into this PR's CI surface (12 weighted issues vs the maxIssues: 10 threshold). The two PRs don't actually depend on each other code-wise — they fix complementary problems at different layers — so I rebased back onto main directly.

What changed:

  • Base branch: fix/pre-existing-test-failuresmain.
  • Commits force-pushed:
  • Added "ReturnCount" to the @Suppress on internalInit so detekt doesn't trip on the baseline-ID mismatch caused by the @Suppress("TooGenericExceptionCaught") annotation. (The existing internalInit ReturnCount baseline entry was for the unannotated signature; once we added the annotation the ID no longer matched.)

Local verification:

  • :OneSignal:core:detekt → BUILD SUCCESSFUL (5 informational findings, all pre-existing on main, well below the maxIssues: 10 threshold).
  • :OneSignal:core:testReleaseUnitTest --tests SDKInitTests → 16/18 pass; the 2 failures are the same pre-existing main failures that fix: resolve pre-existing test failures on main #2605 is fixing (accessor instances after initWithContext without appID and initWithContext with appId does not block), unrelated to this PR.
  • All 3 regression tests in this PR pass.

When #2605 lands on main, this PR will need a future rebase to absorb the overlap in OneSignalImp.kt — that rebase will likely have conflicts since both PRs heavily modify internalInit / initWithContext. Happy to handle that once #2605 merges.

@abdulraqeeb33 abdulraqeeb33 requested a review from a team May 7, 2026 17:34
@nan-li nan-li self-requested a review May 7, 2026 20:07
@nan-li
Copy link
Copy Markdown
Contributor

nan-li commented May 7, 2026

@claude review

Comment on lines +392 to 405
} catch (t: Throwable) {
// Any unchecked throw from initEssentials/bootstrapServices/subscribeToConfigStore/
// updateConfig/userSwitcher.initUser/startupService.scheduleStart would otherwise leave
// initState at IN_PROGRESS forever and `suspendCompletion` uncompleted -- causing
// re-entrant suspend callers (e.g. SyncJobService) to hang indefinitely on `await()`.
// Always reach a terminal state and signal waiters before propagating.
Logging.error("OneSignal: internalInit threw unexpectedly; marking init FAILED", t)
initFailureException = (initFailureException ?: IllegalStateException("OneSignal initWithContext failed.")).apply {
addSuppressed(t)
}
initState = InitState.FAILED
notifyInitComplete()
return false
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Two TOCTOU races on the latch lifecycle re-introduce the very SDK-4475 failure modes this PR is fixing.\n\n**(1) Completer-side wrong-latch race.** notifyInitComplete() (L484) reads the volatile suspendCompletion at .complete() time with no lock and no local capture. All three FAILED paths in internalInit (L359-360, L379-380, L402-403) execute initState = FAILED and notifyInitComplete() as two unsynchronized statements. A concurrent retry can land between them, acquire initLock, observe FAILED → !isSDKAccessible, flip state→IN_PROGRESS and reset suspendCompletion = L2. The original thread then completes L2 (the wrong generation) instead of L1 — L1 waiters hang forever (the SyncJobService budget-slot leak this PR was meant to close), L2 waiters wake on transient IN_PROGRESS state.\n\n**(2) Awaiter-side stale-state race.** After completionToAwait!!.await() returns at L536 / L703, both call sites re-read live volatile initState outside any lock (L548 / L704). Coroutine continuation reschedule is non-trivial, and a concurrent retry-after-FAILED can flip state during that gap. In initWithContextSuspend this returns false based on transient IN_PROGRESS (silent JobService work drop); in waitUntilInitInternal the L548 FAILED check fails, control falls through past "// initState is guaranteed to be SUCCESS here" and the caller invokes getter() against a not-yet-bootstrapped service — exactly the SessionService-NPE class.\n\nBoth PR regression tests serialize the FAILED→retry sequence so neither covers the concurrent window. A single fix addresses both: change the latch payload to CompletableDeferred<InitState>, have completer paths call .complete(state) on a local captured at the start of internalInit, and have awaiter sites return based on the awaited result rather than re-reading the volatile field.

Extended reasoning...

Summary

The @volatile var + reset-under-lock + local-capture-at-await-site pattern landed in 4101068 closes the PR's targeted failure mode but leaves two unaddressed TOCTOU windows on the same latch lifecycle. Both windows are reachable via realistic host-app + SyncJobService interleavings, and both re-introduce the contract violations this PR was specifically designed to fix.

(1) Completer-side wrong-latch race

notifyInitComplete() (OneSignalImp.kt:483-485) is:

private fun notifyInitComplete() {
    suspendCompletion.complete(Unit)
}

It reads the @Volatile field at .complete() time. The await sites local-capture under initLock, but the completer side does not — so it can complete a different generation of the latch than the one it was supposed to release. The non-atomic pair initState = FAILED + notifyInitComplete() appears at three sites:

  • L358-361 (user-locked branch)
  • L378-381 (missing appId branch)
  • L402-403 (catch block)

All three live outside any synchronized block. isSDKAccessible() returns false for FAILED, so a concurrent caller landing in the synchronized block at L678-693 (or L305-315) is allowed to transition state and reset the latch.

Step-by-step proof

Initial state: initState = NOT_STARTED, suspendCompletion = D0.

  1. T_A calls initWithContextSuspend(ctx, "appId"). Synchronized block flips initState = IN_PROGRESS, allocates suspendCompletion = D1. Releases lock. Begins internalInit.
  2. T_C (re-entrant suspend caller, e.g. SyncJobService.onStartJob → initWithContext(this) → initWithContextSuspend(ctx, null)) enters the synchronized block at L678. Observes isSDKAccessible() == true (state is IN_PROGRESS), captures completionToAwait = suspendCompletion = D1. Releases lock. Suspends on D1.await() at L703.
  3. T_A: internalInit body throws (e.g. inside bootstrapServices iterating registered IBootstrapService implementations). Catch block executes at L398-404. Reaches L402: initState = InitState.FAILED. ⏸️ (thread preemption, GC pause, etc.)
  4. T_B (separate thread, e.g. host-app retry OneSignal.initWithContext(ctx, "appId")) enters the synchronized block at L305 (sync path) or L678 (suspend path). Observes initState == FAILED, !isSDKAccessible(). Flips initState = IN_PROGRESS, allocates suspendCompletion = D2. Releases lock. Spawns its own internalInit.
  5. T_A: resumes at L403 → notifyInitComplete() → reads @Volatile suspendCompletion = D2D2.complete(Unit).

Outcomes

  • D1 is never completed. T_C hangs on D1.await() forever — the exact SyncJobService budget-slot leak this PR was meant to prevent.
  • D2 is completed prematurely. Any waiter that captures D2 after T_B's reset wakes immediately. In initWithContextSuspend it returns initState == SUCCESS against transient IN_PROGRESS → false, silently dropping JobService work. In waitUntilInitInternal the L548 FAILED check fails, the function falls through past the misleading "// initState is guaranteed to be SUCCESS here" comment, and the caller proceeds to getter() against a service whose retry-init bootstrap has not run.

The SUCCESS path at L389-390 is safe because state == SUCCESS keeps isSDKAccessible() == true, so no concurrent caller can reset the latch.

(2) Awaiter-side stale-state race

The symmetric defect is on the awaiter side. After completionToAwait!!.await() returns, both call sites re-read live volatile initState outside the lock:

  • waitUntilInitInternal: L536 (await) → L548 (if (initState == InitState.FAILED))
  • initWithContextSuspend: L703 (await) → L704 (return@withContext initState == InitState.SUCCESS)

The gap between latch completion and continuation resumption depends on coroutine dispatcher scheduling and can be substantial. A concurrent retry-after-FAILED during that gap flips state from FAILED to IN_PROGRESS, defeating the post-await check. This race exists even when notifyInitComplete() correctly completes the right latch — i.e. it is independent of the bug above.

For initWithContextSuspend, returning false on transient IN_PROGRESS is a silent JobService work drop. For waitUntilInitInternal, falling through past the FAILED check on transient IN_PROGRESS is the SessionService-NPE class — the caller invokes getter() against services whose retry init has not bootstrapped (configModel.appId not set, userSwitcher.initUser not called, startupService.scheduleStart not run).

Why the PR's regression tests don't cover this

  • initWithContextSuspend resets latch on retry-after-FAILED serializes: firstInit shouldBe false runs to completion before the second init starts.
  • initWithContextSuspend reaches terminal state when internalInit throws is single-threaded.
  • The original in-flight init waits for completion test does not exercise FAILED retry at all.

Neither test launches a thread between initState = FAILED and notifyInitComplete(), and none launch a re-entrant caller between await() returning and the post-await state read.

Suggested fix

A single change addresses both races:

  1. Change the latch payload to encode the terminal state: CompletableDeferred<InitState>.
  2. At the start of internalInit, capture val myCompletion = suspendCompletion (under or just after the lock that reset it).
  3. All terminal-state paths call myCompletion.complete(InitState.FAILED) or myCompletion.complete(InitState.SUCCESS) on the captured local — never on the volatile field.
  4. Awaiter sites use the awaited result (val terminalState = completionToAwait!!.await()) instead of re-reading volatile initState.

This pattern eliminates both windows by making the terminal state of each generation observable as the latch payload, decoupling it from the live volatile field that concurrent retries mutate.

Alternatively, holding initLock across initState = FAILED and notifyInitComplete() closes window (1) but does not close window (2); window (2) requires either the result-in-payload pattern or a loop that re-captures and re-awaits while observed state is IN_PROGRESS.

Comment on lines 309 to 317
}

initState = InitState.IN_PROGRESS
// Fresh latch for this init attempt -- prevents retry-after-FAILED waiters from
// observing the prior init's already-completed deferred.
suspendCompletion = CompletableDeferred()
}

// FeatureManager depends on ConfigModelStore/PreferencesService which requires appContext.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Sync initWithContext(context, appId) resets suspendCompletion and sets initState=IN_PROGRESS inside the synchronized block (lines 305-315), then calls ensureApplicationServiceStarted(context) at line 319 outside any try/catch — the new try/catch only wraps internalInit's body. If that call throws (e.g. ApplicationService.start line 81's context.applicationContext as Application cast on a non-Application context), initState is leaked at IN_PROGRESS and the just-reset latch is never completed, causing every re-entrant suspend caller (initWithContextSuspend, waitUntilInitInternal) to hang on await() indefinitely — the exact inverted failure mode this PR's internalInit catch block was designed to prevent. Fix by mirroring the same try/catch around lines 317-332 (or moving ensureApplicationServiceStarted inside internalInit's protected body).

Extended reasoning...

What the bug is

The PR added a try/catch around internalInit's body (with a comment explicitly naming the hang failure mode it prevents) and a fresh-latch reset under synchronized(initLock) at the start of every init attempt. But on the sync initWithContext(context, appId) entry path, there is unprotected code between the latch reset and the internalInit call. That gap re-introduces the very hang the catch block was designed to prevent.

Code path

OneSignalImp.kt lines 297-333:

override fun initWithContext(context: Context, appId: String): Boolean {
    synchronized(initLock) {
        if (initState.isSDKAccessible()) { return true }
        initState = InitState.IN_PROGRESS
        suspendCompletion = CompletableDeferred()  // L1: fresh latch
    }
    ensureApplicationServiceStarted(context)       // <-- line 319, OUTSIDE try/catch
    if (isBackgroundThreadingEnabled) {
        suspendifyOnIO { internalInit(context, appId) }  // catch lives inside internalInit
        return true
    }
    return runBlocking(runtimeIoDispatcher) { internalInit(context, appId) }
}

ensureApplicationServiceStarted reaches ApplicationService.start(context) (ApplicationService.kt line 78), which does:

val application = context.applicationContext as Application  // line 81

That cast throws ClassCastException for any Context whose applicationContext is not an Application instance — uncommon in production but realistic for hosts initializing from ContentProvider contexts (some apps do this for pre-onCreate init), test/instrumentation harnesses that wrap or mock Context, or restricted/launcher contexts that return a ContextWrapper.

Why existing code does not prevent it

The PR's new try { ... } catch (t: Throwable) { ... initState = FAILED; notifyInitComplete() ... } is scoped to internalInit's body (OneSignalImp.kt lines ~360-405). It does not wrap the lines between the synchronized block and the internalInit invocation in the sync entry path. The PR comment on that catch even enumerates the exact symptom — "would otherwise leave initState at IN_PROGRESS forever and suspendCompletion uncompleted -- causing re-entrant suspend callers (e.g. SyncJobService) to hang indefinitely on await()" — but the protection is one stack frame too narrow.

isBackgroundThreadingEnabled itself is already defensively wrapped (catches Throwable from the feature-manager lookup), so the only realistic throw point in the unprotected window is ensureApplicationServiceStarted.

Step-by-step proof

  1. Host MainApplication.onCreate() calls OneSignal.initWithContext(weirdContext, "appId") where weirdContext.applicationContext is a ContextWrapper (not an Application).
  2. Synchronized block (lines 305-315): initState = IN_PROGRESS, suspendCompletion = L1 (fresh).
  3. Line 319: ensureApplicationServiceStarted(weirdContext)(applicationService as ApplicationService).start(weirdContext)val application = weirdContext.applicationContext as Application throws ClassCastException.
  4. Throw escapes initWithContext. SDK state: initState=IN_PROGRESS, suspendCompletion=L1 uncompleted, applicationServiceStarted=false. Host catches the throw (or the process keeps running because Application onCreate exceptions are sometimes survivable, or a persisted JobService entry outlives the throwing process invocation).
  5. Later, SyncJobService.onStartJob fires and calls OneSignal.initWithContext(this) (suspend, no-appId) → initWithContextSuspend(ctx, null).
  6. Synchronized block at lines 678-693: isSDKAccessible() is true (IN_PROGRESS counts). shouldRunInit=false, captures completionToAwait = suspendCompletion = L1.
  7. Line 703: completionToAwait!!.await()blocks forever on L1, which no code path will ever complete because no internalInit was invoked for this generation.
  8. SyncJobService coroutine hangs until the OS kills the worker, holding its JobService budget slot — the same hang failure mode the PR's catch block was supposed to prevent.

The same hang affects waitUntilInitInternal (line 533): any subsequent accessor call (os.user, os.notifications, etc.) under SDK_BACKGROUND_THREADING blocks forever in runBlocking { waitUntilInitInternal() } -> await() on L1.

Why this is a regression introduced by this PR

  • Pre-PR: suspendCompletion was a single-shot val; initWithContextSuspend did not await() on it. A re-entrant suspend caller observing isSDKAccessible() == true early-returned true without awaiting (the SDK-4475 NPE this PR fixes). A leaked latch was harmless because nothing awaited.
  • Post-PR: re-entrant suspend callers now await() on the freshly-reset latch (line 703), and waitUntilInitInternal does the same (line 533). A leaked latch from a pre-internalInit throw turns the prior NPE-on-stale-state into an indefinite hang. The bug class shifts from "returns too early" to "never returns" — exactly the inversion the PR's internalInit catch was designed to prevent.

Impact

Trigger probability is narrow (most hosts pass an Activity / Application / Service context whose applicationContext is the Application), but the consequence is severe and matches the precise hang the PR is attempting to close. SyncJobService callers hold JobService budget slots until the OS terminates the worker; accessor calls under SDK_BACKGROUND_THREADING will block forever. Importantly, the PR's defensive intent is explicit in the new catch comment, so this is a literal gap in the same defensive pattern rather than a hypothetical future hardening.

Suggested fix

Mirror the internalInit catch around the sync-entry body. Two equivalent options:

Option A — wrap the post-lock body:

synchronized(initLock) { ... }
try {
    ensureApplicationServiceStarted(context)
    if (isBackgroundThreadingEnabled) {
        suspendifyOnIO { internalInit(context, appId) }
        return true
    }
    return runBlocking(runtimeIoDispatcher) { internalInit(context, appId) }
} catch (t: Throwable) {
    initFailureException = (initFailureException ?: IllegalStateException(...)).apply { addSuppressed(t) }
    initState = InitState.FAILED
    notifyInitComplete()
    throw t
}

Option B — move ensureApplicationServiceStarted into internalInit so it falls under the existing try/catch. Would need to preserve the order with the isBackgroundThreadingEnabled evaluation (which currently relies on applicationServiceStarted being true to avoid returning false prematurely).

Comment on lines +685 to +691
initState = InitState.IN_PROGRESS
// Fresh latch for this init attempt.
suspendCompletion = CompletableDeferred()
// Only the call that actually starts init owns the failure-attribution exception.
// Re-entrant callers must not overwrite it -- otherwise the failure stack trace
// would point at the SyncJobService coroutine instead of the original initiator.
initFailureException = IllegalStateException("OneSignal initWithContext failed.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit / pre-existing-style: The fix to gate initFailureException assignment behind shouldRunInit (responding to Copilot 3196748195) inadvertently moved the IllegalStateException(...) constructor inside withContext(runtimeIoDispatcher). The field's declaration comment at OneSignalImp.kt:62 still says "Save the exception pointing to the caller that triggered init, not the async worker thread," and the deleted comment at the old assignment site ("This ensures the stack trace points to the caller that triggered init, not the async worker thread") spelled out the same intent — both are now technically violated. Practical impact is small (the message + suppressed cause carry the real diagnostic info, and on suspend paths the "caller frames" were always coroutine continuation frames anyway), but if you want to honor the documented invariant the construction can be hoisted above withContext while the assignment stays under the lock — i.e. val newFailureEx = IllegalStateException(...) outside, then initFailureException = newFailureEx inside the shouldRunInit branch.

Extended reasoning...

What changed

Pre-PR (per the diff context), initWithContextSuspend constructed and assigned initFailureException immediately on entry, before withContext(runtimeIoDispatcher):

// pre-PR
initFailureException = IllegalStateException("OneSignal initWithContext failed.")
// This ensures the stack trace points to the caller that triggered init, not the async worker thread.
return withContext(runtimeIoDispatcher) { ... }

Post-PR (OneSignalImp.kt:691), the assignment lives inside the shouldRunInit branch of the synchronized(initLock) block — which is itself inside withContext(runtimeIoDispatcher). The constructor therefore now runs on an IO worker continuation, and the captured stack trace contains IO dispatcher frames rather than the caller's coroutine continuation frames. The pre-PR comment that documented the invariant at the assignment site was deleted, but the field-level comment at OneSignalImp.kt:62 still asserts it: Save the exception pointing to the caller that triggered init, not the async worker thread.

Why the change happened

This was a deliberate response to Copilot inline-comment 3196748195: initFailureException is overwritten at the start of every initWithContextSuspend call, even when this invocation does not actually start init. The fix correctly gates the assignment so re-entrant suspend callers (notably the SyncJobService entry point) no longer overwrite the original initiator's failure-attribution exception. That part is sound. The accidental side effect is that hoisting the assignment under the lock also relocated the constructor into the IO continuation.

Practical impact (why this is a nit, not a blocker)

  1. The substantive diagnostic info is preserved. The exception's message (OneSignal initWithContext failed.) is unchanged, and the suppressed cause attached in internalInit (the addSuppressed paths at lines ~691 and ~702) carries the actual failure-site stack into the bootstrap path. Crash reporters and debuggers surfacing this exception still get the meaningful information.
  2. Pre-PR, the "caller's stack" guarantee was already weak on the suspend path. initWithContextSuspend is a suspend function — its callers are already in a coroutine, so the JVM stack at the constructor call site (whether before or after withContext) is dominated by continuation/dispatcher machinery; the user's actual code site (e.g. launch { os.initWithContextSuspend(...) }) is not present at any point. The deleted comment overstated what was actually preserved.
  3. The typical sync initWithContext(ctx, appId) path is unaffected — that path doesn't touch initFailureException at all (only the catch block in internalInit does, on whichever runner thread).
  4. The existing test accessor instances after initWithContext without appID shows the failure reason (SDKInitTests.kt:75-83) only asserts on .message and suppressed[0].message, not on the stack trace, so the regression is not test-detected.

Step-by-step proof of the regression

  1. Host calls os.initWithContextSuspend(context) from inside a coroutine on the Main dispatcher (or from runBlocking on a non-coroutine thread).
  2. Execution enters initWithContextSuspend body. Pre-PR, IllegalStateException(...) is constructed here — its captured stack contains the coroutine's continuation frames on the Main dispatcher (or the runBlocking thread). Post-PR, no construction happens yet.
  3. withContext(runtimeIoDispatcher) switches to an IO dispatcher worker. Post-PR, the synchronized block runs here, and IllegalStateException(...) is constructed now — its captured stack contains IO worker continuation frames.
  4. Init eventually fails. Some later waitUntilInitInternal caller throws initFailureException. Pre-PR: stack trace shows Main dispatcher / runBlocking-thread continuation. Post-PR: stack trace shows IO worker continuation.
  5. Both stacks are coroutine machinery; neither cleanly points at the host app's call site. But pre-PR was at least consistent with the documented invariant of "caller's dispatcher," whereas post-PR is unambiguously "IO worker."

Suggested fix (if you want to honor the documented invariant)

Hoist the constructor above withContext and only the assignment under the lock. The two concerns (no-overwrite and caller-dispatcher capture) are orthogonal:

override suspend fun initWithContextSuspend(...): Boolean {
    Logging.log(...)
    // Capture stack on caller's dispatcher.
    val newFailureEx = IllegalStateException("OneSignal initWithContext failed.")
    return withContext(runtimeIoDispatcher) {
        synchronized(initLock) {
            if (initState.isSDKAccessible()) { ... }
            else {
                initState = InitState.IN_PROGRESS
                suspendCompletion = CompletableDeferred()
                initFailureException = newFailureEx  // assignment under lock; stack captured outside withContext
            }
        }
        ...
    }
}

Or alternately: drop the field-level comment to match what's actually preserved (message + suppressed cause), since the "caller frames" claim was already largely aspirational on a suspend path. Either fix is fine; the current code is internally inconsistent.

Comment on lines +345 to +347
// existing detekt baseline entry was for the unannotated signature; once we added
// @Suppress(TooGenericExceptionCaught) the baseline ID no longer matched, so we
// suppress ReturnCount inline rather than churning the baseline.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, autogenerated comments can be quite verbose and not that relevant, and I'm quite guilty of this too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants